Model Selection

Video understanding

# Video understanding

Vjepa2 Vitl Fpc64 256

V-JEPA 2 is a cutting-edge video understanding model developed by the FAIR team under Meta. It extends the pre-training objectives of VJEPA and has industry-leading video understanding capabilities.

Video Processing

Internvl3 8B Hf

InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.

Transformers Other

Internvl3 2B Hf

InternVL3-2B is a multimodal large language model implemented based on the Hugging Face Transformers library. It performs excellently in multimodal tasks such as image, video, and text processing, supporting multiple input methods and efficient batch inference.

Transformers Other

Smolvlm2 2.2B Instruct

SmolVLM2-2.2B is a lightweight multimodal model designed for analyzing video content. It can process video, image, and text inputs and generate text outputs.

Transformers English

Xgen Mm Vid Phi3 Mini R V1.5 32tokens 8frames

xGen-MM-Vid (BLIP-3-Video) is an efficient and compact vision-language model equipped with an explicit temporal encoder, specifically designed to understand video content.

Safetensors English

Videomae Base Finetuned Subset

A video understanding model fine-tuned on an unknown dataset based on the MCG-NJU/videomae-base model, with an accuracy of 67.13%

Video Processing

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase